Skip to content

Makes document attachment usage consider external attachments #1600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Jun 5, 2025

Conversation

Spoffy
Copy link
Contributor

@Spoffy Spoffy commented May 5, 2025

Context

Documents' attachment usage currently only considers files which are stored internally.

If a document has external attachments, they won't show in the document's "Size of attachments" bar in "Raw Data".

Proposed solution

This updates the attachment usage calculations to use _grist_Attachments.fileSize for external attachments, while preserving the existing logic for internal attachments.

Missing attachments

If a document has external attachments and is uploaded to a Grist instance, these missing attachments will still count towards the document's attachment usage. There's no easy way to mitigate this in all situations without checking all attachments on document open.

Untrusted file size value

This creates a problem with imported documents, which may have an incorrect fileSize value if it has been manually altered (i.e - the value is untrusted).

To mitigate this, the fileSize value is now updated when attachments are uploaded as a .tar archive. This means that eventually, fileSize is always valid for files which are present in external storage.

Related issues

#1021

Has this been tested?

  • 👍 yes, I added tests to the test suite

@Spoffy Spoffy added Attachments bug Something isn't working and removed bug Something isn't working labels May 5, 2025
@Spoffy
Copy link
Contributor Author

Spoffy commented May 5, 2025

This is dependent on the UI PR being merged.

This needs to be rebased once that is merged.

@Spoffy Spoffy changed the base branch from main to spoffy/archive-ui May 5, 2025 22:20
@Spoffy Spoffy marked this pull request as ready for review May 5, 2025 23:28
@berhalak berhalak self-assigned this May 6, 2025
@Spoffy Spoffy requested a review from paulfitz May 6, 2025 15:30
Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we need some tests on just adding and removing attachments and checking that stats update correctly, when those attachments are external. I see some attachment accounting tests in test/nbrowser/DocumentUsage.ts in grist-saas. If those aren't entangled too much with billing would it be worth bringing them across? Or are there already tests?

const attachments = this.docData?.getMetaTable("_grist_Attachments").getRecords();
for (const attachmentRec of attachments ?? []) {
const newSize = newFileSizesByFileIdent.get(attachmentRec.fileIdent);
if (newSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about checking if the size is different from the old recorded size? I believe the data engine will prune changes where a value is written to a cell that already contains that value, but seems easy to cut it out at the source here.

@berhalak berhalak self-requested a review May 9, 2025 08:39
import {server} from 'test/nbrowser/testServer';
import {setupTestSuite} from 'test/nbrowser/testUtils';

describe('DocumentUsage', function() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this test over using DIFF? I see that you've changed the content of this file, and I don't have tools to show me the what was changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can tag which bits are the same? I've done a restructuring on this file to make testing with external attachments easier, so I'm not sure a diff would show many useful things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a rebase that should make this easier.

This commit has all the DocumentUsage changes in:
c90b777

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New commit hash, had to do a rebase to sort a merge issue:
aacb5ee

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you maybe add another file for testing only external attachments and leave this one untouched?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these tests fail if untouched 😅

I can isolate and duplicate the attachments tests if you want, and ensure they pass under both external and internal.

But I'm also not seeing the harm in having a pure-core version of this file? Happy to discuss further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion:

  • Several of the non-attachments related tests will be moved to a new PR
  • This file will be renamed

async function makeSessionAndLogin() {
// login() needs an options object passing to bypass an optimization that causes .login()
// to think we're already logged in when we're not after using `server.restart()`.
// Without this we end up with bad credentials.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you end up with old credentials, not bad. You can be logged into grist using multiple accounts at the same time, this is a feature :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a slight rephrasing.

The original session ends up with old credentials, you're right! But any new session ends up with bad credentials - specifically the bearer token ends up as api_key_for_chimpy, which just gives us the wrong responses from the API 😅

@Spoffy Spoffy force-pushed the spoffy/attachment-size-calc branch from e0a0f53 to 871b9e1 Compare May 12, 2025 14:19
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide whitespace changes for this diff - it'll make it much smaller. A lot of it was just indentation changes.

@Spoffy Spoffy force-pushed the spoffy/attachment-size-calc branch from 6e4d36f to b2c0483 Compare May 12, 2025 14:33
Base automatically changed from spoffy/archive-ui to main May 12, 2025 14:42
@Spoffy Spoffy force-pushed the spoffy/attachment-size-calc branch from b2c0483 to bc5ae34 Compare May 12, 2025 15:51
@Spoffy Spoffy requested a review from berhalak May 20, 2025 13:08
@@ -3618,6 +3620,59 @@ export function withEnvironmentSnapshot(vars: Record<string, any>) {
});
}

export function enableExternalAttachments(transferDelay?: string, preserveEnvVars = true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't found usage of the second argument, and this whole function looks very similar to the one above.

What do you think about:
Changing line 3611 to:
process.env[key] = typeof vars[key] === 'function' ? await vars[key]() : vars[key];

Changing this function to

export function enableExternalAttachments(transferDelay?: string) {
  withEnvironmentSnapshot({
    GRIST_EXTERNAL_ATTACHMENTS_MODE: 'test',
    GRIST_TEST_TRANSFER_DELAY: transferDelay ?? "",
    GRIST_TEST_ATTACHMENTS_DIR: async () => mkdtemp(path.join(await createTmpDir(), 'attachments'))
  });
}

And just reading process.env.GRIST_TEST_ATTACHMENTS_DIR whenever we need this information in tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference between this function and withEnvironmentSnapshot is it only touches specific variables, rather than the whole environment.

The reason is there's a lower chance of accidental issues / conflicts this way, such as when multiple environment snapshots are in use!

That's also why it has that second parameter - to disable the environment restore if it's causing problems.

Let me see if I can deduplicate this a bit though, if that's the main issue here.

Copy link
Contributor Author

@Spoffy Spoffy May 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've integrated all the behaviour I wanted into EnvironmentSnapshot and withEnvironmentSnapshot.

You can now pass them specific variables you want to save/set, and it'll save / restore only those variables, minimising the chance of conflicting changes.

The key reason I want this here specifically is because with gu.enableExternalAttachments, it's not apparent that the environment is being modified - increasing the risk of an accidental problem.

Copy link
Contributor

@berhalak berhalak Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so sorry for the comment above, I'm lost here a little bit with the logic you created. For me this was a simple stack, there was only one env at a time, it was a snapshot not a patch, there are no multiple environment in use (I don't know even what that means for you).

I don't get it, why do you need to touch only specific variables, and how do you create multiple process at the same time? I tested my proposition before, and it worked fine.

Can you revert those changes to the one you had before? Now it is very hard to understand what is going on here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I've moved the helper out to its own file. :)

}
}

const action: BulkUpdateRecord = ['BulkUpdateRecord', '_grist_Attachments', rowIdsToUpdate, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth checking if rowIdsToUpdate is empty and returning earlier (without all external calls)?

await this._applyUserActionsWithExtendedOptions(
docSession,
[action],
{ attachment: true },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{ attachment: true },
{attachment: true},

Saas linter will mark this. We have a rule: objects without spaces, scopes with spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the warning, might be worth bringing that rule over to core!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do, I already tried and failed, but don't remember the specifics.

import path from 'path';
import * as gu from 'test/nbrowser/gristUtils';
import {fileDialogUpload, TestUser} from 'test/nbrowser/gristUtils';
import {server, setupTestSuite} from 'test/nbrowser/testUtils';
import * as testUtils from 'test/server/testUtils';
import {setupTestSuite} from 'test/nbrowser/testUtils';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a refactor, am I right? I dont see any actual code change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - the uses for server and testUtils were replaced by a call to gu.enableExternalAttachments


it('shows usage stats on the raw data page', async function() {
await session.tempNewDoc(cleanup, "EmptyUsageDoc");
await testDocUsageStatsAreZero();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to inline this method. It is not used anywhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this one as is - it's not used here right now, because I moved the extra tests to another branch.

When I open a PR for that, this will be used in multiple places again :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, we usually avoid doing this that way. If it is not used right now in multiple plays, lets introduced this abstraction later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, will pull this out.

fileSize: newFileSizesForRows
}];

await this._applyUserActionsWithExtendedOptions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about wrapping it with try catch ? Asking because, this will throw if ids are not valid, it is very unlikely but still possible if the docData we have in memory is not exactly the same as the one in data engine, everything here is async, so maybe it can somehow interfere with other modifications.

Ideally this would all be done in "transaction like" scope, but I don't know if it is important to worry about that here. We have some mechanism for locking the doc modifications, but maybe this is overkill here. But there is other mechanism available - the upsert action BulkAddOrUpdateRecord, where you can update those values based on the fileIdentity, not the ids.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a difficult question. The options seem to be:

  • Don't catch, fail to import the attachments
  • Catch, import the attachments with an incorrect file size

If this is very rare, I think I'd prefer to let it fail and force the user to re-import, than import attachments with an invalid file size.

I don't think upsert is an option unfortunately, as it's not correct to add a new record if it doesn't exist. Best case, we have an extra attachment being tracked that doesn't exist. Worst case, it breaks other code because it's not a fully valid record that's added (unless we try to reinsert the whole record...).

My inclination is to leave it, add a comment, see if it actually comes up in production.

How do other parts of the codebase handle the in-memory data and data engine being out of sync? Just locking?

Copy link
Contributor

@berhalak berhalak Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum was to wrap this in try catch, and rethrow something more meaningful, explaining what has happen.

I don't know for sure, but we have some data corruptions problems we don't know how to fix or don't know the reason. So here, we know that the ids can go wrong, so we can anticipate it and explain in more details.
Since we don't do any performance or real load tests, waiting for this to happen on prod is a wrong idea, we don't know anything about how this would work on prod.

I just checked the docstring for the BulkAddOrUpdateRecord, there is a on option parameter that disables adding if it can't find records to update, so I think it actually does look as a good option, wdyt? Here is a docstring:

   `options` is a dictionary with optional settings to choose other behaviours:
    - Set "on_many" to "all" or "none" to change which records are updated when several match.
    - Set "update" or "add" to False to disable updating or adding records respectively,
      i.e. if you only want to add records that don't already exist
        or if you only want to update records that do already exist.
    - Set "allow_empty_require" to True to allow `require` to be an empty dictionary,
      which would mean that every record in the table is matched.
      Otherwise this will raise an error to prevent mistakes like updating an entire column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this, but apparently that method doesn't support metatables ☹️

Instead, I've wrapped it in a more specific error, and added some logging around that error too. How does that look?

gu.enableExternalAttachments();

async function makeSessionAndLogin() {
// login() needs an options object passing to bypass an optimization that causes .login()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not needed now, right? I think we fixed this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was avoided in a different way, by not restarting the server except at the start of the test file.

I can remove this as a result, but at the same time, it would be nice to document this behaviour somewhere. But not sure where.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I meant, that we actually fixed this issue in other PR, but I can't find it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My memory was that this was intended behaviour and we weren't changing it, but my memory might be wrong! :)

@berhalak berhalak self-requested a review June 2, 2025 08:31
Copy link
Contributor

@berhalak berhalak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those tests failures related to this PR?

@Spoffy Spoffy requested a review from berhalak June 2, 2025 20:37
enableExternalAttachmentsForTestSuite
} from 'test/nbrowser/externalAttachmentsHelpers';

describe('DocumentUsage', function() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
describe('DocumentUsage', function() {
describe('DocUsageTracking', function() {

The convention is to have the same name as the file.

Comment on lines 51 to 54
await api.applyUserActions(docId, [['AddEmptyTable', "AttachmentsTable"]]);
await gu.getPageItem('AttachmentsTable').click();
await gu.waitForServer();
await addAttachmentColumn('Attachments');
Copy link
Contributor

@berhalak berhalak Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
await api.applyUserActions(docId, [['AddEmptyTable', "AttachmentsTable"]]);
await gu.getPageItem('AttachmentsTable').click();
await gu.waitForServer();
await addAttachmentColumn('Attachments');
await gu.sendActions([['AddEmptyTable', "AttachmentsTable"]]);
await gu.getPageItem('AttachmentsTable').click();
await gu.addColumn('Attachments', 'Attachment');

When you change the document through the API, you are not waiting for those change to propagate back to the browser. It works 99% of time, but introduce flakines.

The second waitForServer is not needed I think, is it?

Your helper for adding attachments column is doing the same thing that this call:
await gu.addColumn('Attachments', 'Attachment');

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice, will put in these changes.
Not entirely sure if it's needed or not. I'll double check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made these changes, you're 100% right, that second waitForServer isn't needed - gu.sendActions waits :)

await driver.find('.test-right-tab-field').click();
await gu.addColumn(columnName);
await gu.setType(/Attachment/);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper is not needed, there is a type parameter in the addColumn, which does the same thing as you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to know, removed it :)

@Spoffy Spoffy requested a review from berhalak June 3, 2025 18:00
@Spoffy Spoffy merged commit 0d159d1 into main Jun 5, 2025
22 of 23 checks passed
@Spoffy Spoffy deleted the spoffy/attachment-size-calc branch June 5, 2025 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants